Multi-Core Software
ثبت نشده
چکیده
The fast introduction of the Intel CoreTM2 Duo and Quad processors to the mass market has drawn attention to threadization (a.k.a. parallelization) and vectorization of the existing code in many application domains. In fact, multi-core processor vendors are eager to enable their users to exploit various levels of parallelism in order to harness the additional compute resources of multi-core processors. The Intel C++/Fortran compiler provides an essential tool for unleashing the power of Intel Core 2 Duo and Quad processors. This is accomplished by means of high-level loop optimizations and scalar optimizations to exploit multi-core processors and single-instructionmultiple-data (SIMD) instructions, combined with advanced code generation, that is built on an intimate knowledge of micro-architectural performance aspects. In this paper we outline the design and implementation of a new threadizer and vectorizer inside the Intel 10.1 compilers, and we also provide an overview of the enhanced high-level loop optimizations and the low-level code generation used to obtain higher performance on platforms based on Intel Core 2 Duo and Quad processors. Significant performance gains are shown using the SPEC CPU2006 suite running on a system configured with two Intel quad-core processors. INTRODUCTION The aggressive delivery of Intel multi-core processors to the mass computer market shows that, as the performance improvements from continuously increasing clock frequencies start to taper off, other architectural advances that reduce latency or increase memory bandwidth are gaining importance [9]. In particular, since packaging densities are still growing, integrating multiple processors on a single die and using SIMD extensions are becoming more widespread [1]. The Intel Core 2 Duo and Quad processors are equipped with a rich set of microarchitectural and architectural features to boost performance: • dual-core or quad-core on a single chip • wider execution units for Streaming SIMD Extensions (SSE, SSE2, SSE3) • a set of new instructions referred to as Supplemental Streaming SIMD Extensions 3 (SSSE3) • advanced smart shared L2 cache among cores on the same chip Due to the complexity of modern processors, compiler support has become an important part of obtaining higher performance. Most importantly, to assist programmers in leveraging all parallel capabilities of Intel’s new processors, the Intel C++/Fortran compiler provides an essential tool for unleashing the power of Intel multi-core processors and SIMD instructions by means of high-level optimizations and advanced code generation. The Intel compilers perform automatic optimizations of programs using threadization [10], vectorization [1, 2, 5], classical loop transformations (e.g., distribution, unrolling, interchange, fusion) [7, 11, 12], scalar optimizations such Intel Technology Journal, Volume 11, Issue 4, 2007 Inside the Intel 10.1 Compilers: New Threadizer and New Vectorizer for Intel CoreTM2 Processors 264 as constant propagation, Partial Dead Store Elimination (PDSE), Partial Redundancy Elimination (PRE), copy propagation, Inter-Procedural Optimizations (IPO) [7], and advanced machine code generation techniques that together yield a significant performance gain compared to the default level of optimization. The contributions of the new threadizer and vectorizer are as follows: • The new threadizer yields up to 4.63x speedup (with 8 cores) by exploiting thread-level parallelism from a serial program in the SPEC CPU2006 benchmark suites. Overall, the auto-threadization delivers a 15.45% gain (geomean with 8 cores) for SPEC CFP2006 suite and a 12.17% gain (geomean with 8 cores) for SPEC CINT2006 suite. • The new vectorizer yields up to 1.28x performance speedup by exploiting SIMD-type vector parallelism from a serial program in the SPEC CPU2006 suites. Overall, the auto-vectorization delivers a 5.11% gain (geomean) for SPEC CFP2006 suite and a 2.01% gain (geomean) for SPEC CINT2006 suite. The rest of this paper is organized as follows. First, we provide some basics on the Intel CoreTM microarchitecture. Then, we discuss the design and implementation of the new threadizer and vectorizer, respectively, inside the Intel 10.1 compilers. Subsequently, we discuss the loop optimizations and enhancements made to support efficient threadization and vectorization. We also present an overview of advanced code generation for the Intel Core 2 Duo and Quad processors. Finally, we provide performance results using the SPEC CPU2006 industry-standard benchmark suite built with the Intel 10.1 C++ and FORTRAN compilers. INTEL CORETM MICROARCHITECTURE Intel Core micro-architecture is the foundation for all new Intel architecture-based desktop, mobile, and server multi-core processors. This state-of-the-art multi-core processor with optimized micro-architecture delivers a number of innovative features that have set new standards for energy-efficient performance. In this section we outline a few innovations relevant to this paper. A more detailed description can be found in the Intel literature [4]. Figure 1: Quad-core processor schematic Figure 1 shows a schematic of the Intel Core 2 Quad processor. Two independent cores with their own private L1 caches reside on a single die. Two shared Level 2 (L2) caches, referred to as the Intel Advanced Smart Cache, work by sharing the L2 cache between cores so that data are stored in one place accessible by the cores. Sharing the L2 cache enables a core to dynamically use up to 100% of the available L2 cache, thus optimizing cache resources. The quad-core processor is equipped with Intel Smart Memory Access techniques that boost system performance by optimizing available data bandwidth from the memory subsystem and hiding the latency of memory accesses through two techniques: memory disambiguation and an instruction pointer-based prefetcher that fetches memory contents to the shared L2 cache and then into each private L1 cache before they are requested. The data prefetcher can detect strided memory access patterns to make accurate predictions about future load addresses. Another key feature of Intel Core micro-architecture is the Intel Advanced Digital Media Boost that can issue 128bit SSE instructions with a throughput of one per clock cycle. Previous-generation Intel processors had a sustained throughput of one instruction per two clock cycles, typically one cycle for the lower 64 bits followed by another cycle for the upper 64 bits. By widening execution units to the full 128 bits, the Intel processor effectively doubles the performance of a series of 128-bit SSE instructions relative to previous-generation Intel processors. In addition, the latency of various individual 128-bit SSE instructions has been reduced, and SSSE3 has been added to extend the instruction set. As a result, more overall performance improvements can be expected from vectorization (i.e., transforming sequential code into SIMD instructions). REVAMPING THE THREADIZER In this section, we present our new threadizer framework that is highly integrated with our classical high-level loop optimizations, and we describe its main components. The strengths of the new threadizer include the following: • A new Abstract Thread Representation (ATR), based on the concept of virtual threads, is designed to bridge the semantic gap between high-level representation and physical (hardware or OS) threads. • Better interaction with other high-level loop-related optimizations gives better performance. • The new threadizer is moved downstream to take advantage of scalar optimizations such as global constant propagation and Single-Static-Assignment (SSA) PRE, and some loop optimizations. • A table-driven cost model simplifies maintenance and future extensibility. Intel Technology Journal, Volume 11, Issue 4, 2007 Inside the Intel 10.1 Compilers: New Threadizer and New Vectorizer for Intel CoreTM2 Processors 265 • Effective runtime threadization control and multiple schedule types such as static, dynamic, guided, and runtime are supported. The threadizer in the Intel compiler serves as a single module that covers different languages (C++ and Fortran), architectures (IA-32, Intel 64, and IA-64), and operating systems (Microsoft Windows*, Linux*, and MacOS*).
منابع مشابه
Software Development for Parallel and Multi-Core Processing
The embedded software industry wants microprocessors with increased computing functionality that maintains or reduces space, weight, and power (SWaP). Single core processors were the key embedded industry solution between 1980 and 2000 when large performance increases were being achieved on a yearly basis and were fulfilling the prophecy of Moore’s Law. Moore’s Law states that “the number of tr...
متن کاملDesign of Multi-core Processor Software with Pipelining Strategy
At present, in the field of application of numerous IT, multi-core processors are widely used, the hardware development is much faster than the speed of software development. Multi-core processor software is using the design concept of single-core processor software, so this software is not highperformance and the performance processor is not fully played. This paper firstly introduces the trad...
متن کاملBasic System-Level Software for a Single-Core MERASA Processor
In the EC FP-7 MERASA project a hard real-time capable multi-core processor is developed. The system-level software represents an abstraction layer between application software and embedded hardware. It has to provide basic functions of a real-time operating system. This report presents requirements for a multi-threaded hard real-time capable system-level software in embedded systems and the tr...
متن کاملEarly Timing Analysis of Vehicular Systems: the Road from Single-core to Multi-core
In the software development for vehicular embedded systems, timing predictability is paramount for the development of the vehicles’ safety features and for their customer value. Modern vehicles’ features require new level of computational power. On the one hand, multicore platforms can provide efficient support for these features. On the other hand, multi-core platforms complicate the software ...
متن کاملModeling of Vehicular Distributed Embedded Systems: Transition from Single-core to Multi-core
Modeland component-based software development has emerged as an attractive option for the development of vehicle software on single-core platforms. There are many challenges that are encountered when the existing component models, that are originally designed for the software development of vehicular distributed single-core embedded systems, are extended for the software development on multi-co...
متن کاملModeling and visualizing networked multi-core embedded software energy consumption
In this report we present a network-level multi-core energy model and a software development process workflow that allows software developers to estimate the energy consumption of multi-core embedded programs. This work focuses on a high performance, cache-less and timing predictable embedded processor architecture, XS1. Prior modelling work is improved to increase accuracy, then extended to be...
متن کامل